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ABSTRACT 


Supervised machine learning has become one of the most im- 
portant methods for developing educational and intelligent 
tutoring software; it is the backbone of many educational 
data mining methods for estimating knowledge, emotion, 
and other aspects of learning. Hence, in order to ensure opti- 
mal utilization of computing resources and effective analysis 
of models, it is essential that researchers know which eval- 
uation metrics are best suited to educational data. In this 
article, we focus on the problem of wrapper feature selection, 
where predictors are added to models based on how much 
they improve model accuracy in terms of a given metric. 
We compared commonly-used machine learning algorithms 
including naive Bayes, support vector machines, logistic re- 
gression, and random forests on 11 diverse learning-related 
datasets. We optimized feature selection based on nine dif- 
ferent metrics, then evaluated each to address research ques- 
tions about how effective each metric was in terms of the 
others (e.g., does optimizing for precision also result in good 
F1?) as well as calibration (i.e., are predictions produced by 
models accurate probabilities of correctness?). We provide 
empirical evidence that the Matthews correlation coefficient 
(MCC) produced the overall best results across the other 
metrics, but that root mean squared error (RMSE) selected 
the best-calibrated models. Finally, we also discuss issues 
related to the number of features selected when optimizing 
for each metric, as well as the types of datasets for which 
certain metrics were more effective. 
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1. INTRODUCTION 


Machine learning is a popular method for building predic- 
tive models that automatically estimate various aspects of 
learning. These models, in turn, can be applied to study 
the processes of learning or teaching, or to automatically 
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guide students as they learn. Training models is a complex 
process, however. The space of possible machine learning 
models is far too large to fully explore, and thus the search 
space is typically narrowed by focusing on candidate mod- 
els that appear promising via some measure of correctness 
(agreement with ground truth labels, for supervised classifi- 
cation), such as Cohen’s kappa or F; [16, 40]. One common 
methodological step that involves model selection (narrow- 
ing the search space) is wrapper forward feature selection 
[29], a process wherein features are added one at a time to 
a model based on which feature produces the largest gain 
in model correctness. Changing the correctness metric by 
which features are evaluated can have a significant impact 
on the final selected model (which we demonstrate in this pa- 
per); however, little is known about exactly what these im- 
pacts are for different correctness metrics. In this paper, we 
address this problem by performing feature selection based 
on different metrics and comparing the resulting models. 


Previous work in the area of examining correctness met- 
rics for educational data mining has largely focused on what 
those metrics reveal about models [40, 10]. Related work has 
shown, for example, that area under the receiver operating 
characteristic curve (AUC or AUROC) ignores the scale of 
model predictions [40], and that F; can be increased by over- 
predicting the positive class [10]. From such findings we can 
generate hypotheses about the properties of models that re- 
sult from relying on those metrics during feature selection. 
For example, we might expect recall- and Fi-based feature 
selection to favor models that over-predict the positive class. 
However, there is little empirical evidence to support such 
hypotheses, which we aim to provide in this paper. 


We explore a wide variety of correctness metrics for feature 
selection, evaluating them on 11 education-related datasets, 
to empirically measure relationships between feature selec- 
tion metrics and resulting models. We include well-known 
and extensively-used metrics like AUC, Cohen’s kappa, and 
others, as well as metrics that are less-commonly used but 
perhaps equally valuable, like the Matthews correlation coef- 
ficient and the minimum proper AUC. We experiment with 
metrics and datasets across four commonly-used machine 
learning classifiers, including support vector machine, naive 
Bayes, logistic regression and random forest. These algo- 
rithms have been frequently applied with great success in 
educational data mining and related research [24, 21, 43, 9], 
including in situations where high-dimensional data require 
feature selection [27, 49, 34]. 
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To the best of our knowledge, ours is the first work to ex- 
plicitly test differences between correctness metrics in the 
context of feature selection. Our results are valuable for fu- 
ture educational data mining research and practice by pro- 
viding guidance to machine learning experts who wish to 
make evidence-based decisions about their model building 
methods. In particular, we characterize metrics in terms 
of the models that result from performing feature selection 
based on those metrics, which will help researchers decide 
on appropriate metrics based on the desired properties of 
their resulting models. 


2. RELATED WORK 


While previous research and other projects in this area is 
limited, there have been a few relevant research projects 
with findings that significantly informed our current work. 
In this section, we describe metrics evaluated in this study 
along with examples where they were used in previous work, 
then discuss directly-related work on evaluating metrics in 
educational data mining. 


2.1 Metrics and their Usage 

Accuracy. In this paper, accuracy refers to the proportion 
of correctly classified instances, though in other contexts 
it may refer more generally to any measure of how well a 
model’s predictions align with ground truth values. Accu- 
racy is one of the most straightforward metrics to calculate 
and understand, and thus has been reported frequently in 
machine learning studies [35, 12]. However, previous re- 
search has noted flaws with accuracy. In situations where 
labels are imbalanced, accuracy is often attenuated [25] or 
inflated [10] depending on the rate at which the model pre- 
dicts the majority class. Despite possible flaws, it is com- 
monly examined and is often the default correctness measure 
in machine learning software [39], including in wrapper fea- 
ture selection software [41], so we include it in this paper. 


AUC. AUC measures model correctness in terms of true 
positive rate across every possible false positive rate (i.e., 
across all possible decision thresholds). Chance level AUC 
is 0.5, while a perfect model has AUC = 1 and a completely 
incorrect model has AUC = 0. AUC is a valuable metric for 
its clear interpretability and effectiveness in the face of class 
imbalance [25], and has often been reported as an evaluation 
metric on educational datasets (e.g., [26, 23, 40, 37]). How- 
ever, it only measures correctness in terms of the order of 
predicted values, not their scale [40], so it is unclear whether 
selecting features based on AUC will result in models that 
may have poorly-scaled predictions (an issue we explore in 
this paper). A related metric is the area under the precision— 
recall curve (AUPRC) [44], which also considers all possible 
decision thresholds. We have not yet included AUPRC in 
analyses, but expect that its behavior with respect to scale 
of predictions may be similar to AUC. 


MPAUC. In situations where models provide only binary 
predictions, an approximation of AUC can be calculated 
by measuring the minimum proper AUC (MPAUC) of the 
quadrilateral formed by the single available decision thresh- 
old [38], as shown in Figure 1. We refer to this metric 
as MPAUC for the sake of brevity when reporting results, 
though it is not typically abbreviated in previous literature. 
It differs from AUC in that it measures the area for a “curve” 


defined by a single point instead of many points as in AUC. 
Its advantage is that it is applicable even when continuous 
decision thresholds are not available. MPAUC has been uti- 
lized as a metric for feature selection in prior educational 
data mining research [9], but it is unclear how it compares 
to alternatives we explore in this paper. 


True positive rate 


False positive rate 


Figure 1: Example MPAUC (shaded area). 


MCC. The Matthews Correlation Coefficient (MCC) mea- 
sures the correlation between two binary variables (predicted 
labels and actual labels) [30], and is equivalent to Pearson’s 
r for two binary variables (i.e., ¢). MCC ranges from -1 
to 1, where 0 indicates chance level and 1 indicates perfect 
classification. MCC is especially useful in binary classifica- 
tion models where there is class imbalance, since its chance 
level is not affected by imbalance. MCC is simply a corre- 
lation coefficient between the true and predicted class. It is 
only defined for binary variables. While it is not common in 
educational data mining research, it has been occasionally 
reported [8, 1] and is valued in other machine learning fields 
[15]. 


Recall. Recall is the proportion of a certain label class 
(typically the positive class) that was correctly identified as 
being in that class [46, 4]. Recall is an informative measure 
for understanding model correctness, especially in situations 
where it is important to focus on one class (e.g., in situations 
where false negatives are costly). However, it can be inflated 
by over-predicting the positive class [10] and is thus not 
often reported as the sole measure of model correctness, so 
it is unclear whether it is appropriate as a metric for feature 
selection. 


Precision. Precision is similar to recall; it is the proportion 
of instances predicted as being in the positive class that were 
correct predictions. Like recall, it is typically only reported 
in conjunction with other correctness metrics, but unlike 
recall it cannot be inflated by over-predicting the positive 
class [10]. However, in some cases it can be maximized by 
predicting the positive class for only a few of the highest- 
confidence instances. 


F,. Fi is defined as the harmonic mean of precision and 
recall, and thus avoids some of issues of recall (favoring 
over-prediction of the positive class) and precision (under- 
predicting the positive class). However, it can be inflated 
by over-predicting the positive class [10], so it is unclear 
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whether selecting features based on F, will favor models 
that over-predict the positive class or not. 


RMSE. RMSE (root-mean-square error) measures the Eu- 
clidean distance between predictions and ground truth la- 
bels. Since RMSE is an error metric, lower values are better, 
with 0 indicating no error. It is commonly associated with 
regression problems, since it can be easily calculated for con- 
tinuous labels, but is also effective for binary classification 
with models that produce continuous-valued probability pre- 
dictions [40, 13]. Previous research has noted that RMSE 
is especially effective for optimizing probabilistic predictions 
[40]; thus, we expect that selecting features based on RMSE 
might also produce models with well-calibrated probabili- 
ties (where model confidence matches the probability that 
the model is correct). Like AUC, RMSE does not require 
setting a decision threshold, unlike the other metrics we con- 
sider in this study. We refrained from using close variants 
like Mean Absolute Deviation (MAD) or Error (MAE), since 
previous work has noted issues with these metrics for model 
selection [40]. 


Kappa. Cohen’s kappa («) was developed as a measure 
of agreement between human annotators [16], but has of 
ten been utilized as a machine learning correctness metric 
by measuring the agreement between ground truth labels 
and predicted labels [10]. Like correlation measures, kappa 
ranges from -1 to 1 where 0 is random chance and 1 indicates 
perfect classification. 


2.2 Research on Metrics in Educational Data 
Mining 

Previous research has focused on metrics primarily in terms 

of the perspective that metrics have on a model of students, 

or on the properties of the model that are highlighted (or 

hidden) by particular metrics. 


In one previous project, researchers focused on evaluating 


the properties of metrics that require continuous (probability- 


like) predictions [40]. In particular, they focused on AUC, 
RMSE, mean absolute error (MAE), and log likelihood (LL). 
They noted that for some applications (e.g., prediction of 
probability that a student has mastered a specific skill) met- 
rics such as AUC do not favor well-calibrated models. They 
also compared metrics in terms of how often they agreed 
on picking the best model out of a pool of 20 simulated 
datasets, finding that RMSE and LL frequently agreed (17 
out of 20) but others agreed much less often; the second- 
highest agreement was between RMSE and AUC, on 7 out 
of 20 datasets. This is especially relevant to the work in 
this paper, where we compare properties of metrics applied 
across 11 real-world datasets. 


In similar previous work, researchers compared the proper- 
ties of metrics that require binary or categorical predictions, 
rather than continuous predictions [10]. They noted that F1 
is influenced by the base rate of the positive class in data, in 
line with other research on Cohen’s kappa, AUC, and other 
metrics [25]. However, they also noted that F1 (and recall) 
are influenced by the predicted rate of classifiers. This find- 
ing is especially relevant to the current research because it 
is possible that feature selection will favor models and fea- 
tures that tend to predict more of the positive class when 


selecting based on these metrics. 


3. FRAMING THE PROBLEM 


The goal of this paper is, broadly speaking, to provide em- 
pirical results that illustrate the relationships 1) among dif- 
ferent metrics, and 2) between metrics and models, when 
metrics are employed for forward feature selection. 


Sequential feature selection is a type of wrapper (model- 
based) feature selection in which a feature is added to or re- 
moved from a model, the model is re-trained, and the quality 
of the feature in question is assessed based on improvement 
in model correctness (as measured by some metric). In this 
study, we specifically performed forward feature selection by 
adding one feature at a time, stopping when all features were 
added or when the model had not improved for three consec- 
utive features, then returned the set of features with maxi- 
mum correctness among all the combinations explored. Our 
work focuses primarily on the effects of utilizing different 
metrics for the step in which model correctness is assessed, 
which drives the entire feature selection process. We define 
four research questions (RQs) to explore this problem: 


RQ1: When selecting features based on a specific 
metric, how do the results vary in terms of the 
other metrics? Addressing this question will inform deci- 
sions about which metric to apply during feature selection 
by showing the relationships between metrics. For exam- 
ple, some low-cost applications may benefit from high recall 
(e.g., automatically selecting the most relevant material for 
students to review) while other higher-cost applications may 
require high precision (e.g., automatically predicting when a 
teacher should intervene to redirect learning behaviors). In 
these examples, we may wish to optimize feature selection 
for different metrics, but it is crucial to understand how that 
might influence other metrics; e.g., does optimizing feature 
selection for AUC tend to produce models that are also good 
in terms of Cohen’s kappa, recall, and the other metrics? 


To address RQ1 we define the ranking of a metric with re- 
spect to all the other metrics. Specifically, given a set of 
metrics M, a selection metric X € M has rank 0 with re- 
spect to another metric Y € M if selecting features based 
on X results in the best! value of Y compared to selecting 
features based on all other metrics in M. Likewise, a metric 
Z €M has rank 1 with respect to Y if selecting features 
based on Z produces the second-best value of Y compared 
to all other metrics in M, and so on. Generally, we ex- 
pect that selecting features for some metric X € M will 
have rank 0 with respect to itself (X), though this is not 
necessarily always true. Furthermore, some metrics may be 
generally better than others in terms of rank, if they tend to 
favor models with well-rounded properties that satisfy each 
metric. We thus calculate the mean ranking of each metric 
as the mean of all rankings for a metric with respect to it- 
self and all other metrics (nine in total, in this paper), as a 
way to discover which feature selection metrics tend to yield 
models that satisfy the wide range of criteria imposed by 
different metrics. 


Best” meaning highest for most metrics, but lowest for 
RMSE since it is an error metric. 
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RQ2: How do different feature selection metrics im- 
pact model calibration? As previous work noted, some 
metrics do not penalize models for being poorly calibrated 
[40]. However, it remains unclear how large of an effect us- 
ing different metrics during feature selection may have on the 
calibration of the resulting model. We address this research 
question by calculating CAL scores (described in Sec. 4.4) 
for models selected based on each metric [12]. 


RQ3: How do different feature selection metrics im- 
pact the predicted rates of models? Certain correctness 
metrics favor over- or under-prediction of the positive class 
more than others. For example, accuracy for a problem with 
imbalanced classes can be increased simply by biasing pre- 
dictions of the positive class in the same direction as the 
imbalance in the data [10]. We might expect that relying 
on accuracy for feature selection could thus result in models 
that over or under-predict the positive class, but it is unclear 
how problematic these effects may be, which we measure in 
addressing this research question. 


RQ4: Do some feature selection metrics tend to re- 
sult in more parsimonious models (fewer features) 
than others? In addressing this research question, we fur- 
ther characterize the models that result from applying dif- 
ferent metrics during feature selection, and highlight cases 
where feature selection may fail (by selecting too few fea- 
tures) or unnecessarily increase model complexity (by se- 
lecting an unusually large number of features). 


4. EXPERIMENTS 


We performed a variety of experiments to address our re- 
search questions, consisting of training and testing machine 
learning classifiers with forward feature selection. Experi- 
ments required approximately 11 months of continuous run 
time”, given that we performed extensive hyperparameter 
selection with 4 classifiers, 11 datasets, and 9 feature selec- 
tion metrics, as detailed in this section. 


4.1 Classifiers 


As mentioned in the Introduction, we trained models includ- 
ing random forest, support vector machines, naive Bayes, 
and logistic regression. These machine learning algorithms 
represent a variety of methods with differing assumptions 
and levels of flexibility, and which are frequently employed 
in educational data mining research [18, 5, 21, 43, 20, 7, 11, 
45]. Moreover, with the possible exception of random for- 
est, these models quite often benefit from feature selection 
to avoid problems of over-fitting (e.g., when a logistic regres- 
sion has nearly as many parameters as instances) [33] and 
collinearity (e.g., when two very similar features incorrectly 
double the impact of a relationship in a naive Bayes model). 


4.2 Cross-validation 

We utilized student-level four-fold cross-validation, training 
each model on data from 75% of students and testing it on 
the remaining 25% of students, then repeating a total of four 
times until each student was in the testing data exactly once. 
This procedure ensured that data from the same student was 


?Experiments were run on an Intel Core i7 4.2 GHz proces- 
sor (using a single core) with 32 GB memory and 256 GB 
storage. 


never present in training and testing at the same time, which 
was crucial given that some of our datasets had multiple 
instances per student. 


We performed nested (within training data) student-level 
four-fold cross-validation for evaluating hyperparameters and 
selecting features. Specifically, for every possible combina- 
tion of hyperparameters, we performed forward feature se- 
lection, then stored the best result from the feature selection 
process (according to the current selection metric). Finally, 
we retrained the model using the best set of hyperparame- 
ters, including the best features, on all training data, and 
applied it to the testing data. Hyperparameter selection and 
feature selection did not involve the testing set in any way. 


There are two common strategies for evaluating the results 
of cross-validation. The first, macro-level averaging, con- 
sists of calculating the desired correctness metric for each 
fold and averaging across folds (four folds, in our case). The 
second strategy, micro-level averaging, involves storing the 
predictions of each fold and calculating the correctness met- 
ric once at the end based on all predictions. We evaluated 
both strategies to assess possible differences on the feature 
selection process. 


4.3  Hyperparameters 

We extensively tested common hyperparameters for each 
classification algorithm to ensure models had a chance to 
fit to the very different properties of our datasets (e.g., type 
of data, number of features, size of dataset). 


For random forest we set the number of trees at 50 (signif- 
icantly increasing this proved infeasible for an already-long 
run time). We varied the minimum number of samples re- 
quired to create a branch in each tree, trying 5 different val- 
ues (2, 4, 8, 16, or 32). This hyperparameter controls model 
complexity by restricting how fine-grained the decisions in 
each tree can be. We also varied the number of features 
randomly chosen for building each tree, testing 4 options in- 
cluding proportions of .25, .50, .75, and the square root of 
the number of features (the default setting). This hyperpa- 
rameter controls how different trees are from each other in 
terms of the features from which they are trained. In total, 
there were 5 x 4 = 20 combinations of hyperparameters for 
random forest. 


We trained SVMs with the radial basis function (RBF) ker- 
nel, which has a hyperparameter y that controls the size 
(radius of influence) of each RBF kernel. We tried values 
for y of 0.001, 0.01, 0.1, 1, and 10. Similarly, we tuned C, 
the SVM complexity hyperparameter, over the same set of 5 
possible values. There were thus 5 x 5 = 25 hyperparameter 
combinations for SVM. 


Naive Bayes has little in the way of hyperparameters to tune, 
apart from the distribution assumption to use. We assumed 
a Gaussian distribution for all models, and thus did not 
perform grid search across hyperparameters. 


We trained logistic regression models with L2 regularization, 
and tuned the strength of regularization as a hyperparam- 
eter over the space of 5 possible values: 0.001, 0.01, 0.1, 1, 
and 10. 
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Finally, we experimented briefly with hyperparameters re- 
lated to class imbalance in the datasets, after noting that 
models frequently learned to only predict the majority class. 
We initially experimented with re-weighting instances of the 
minority class with higher weight set as a hyperparameter, 
but ultimately found that generating synthetic minority- 
class data via SMOTE (Synthetic Minority Over-sampling 
TEchnique [14]) was more consistently effective across our 
datasets without requiring hyperparameter tuning. 


4.4 Measuring Model Calibration 

Calibration refers to how well a model’s predicted probabili- 
ties match the probability that those predictions are correct. 
For example, given a set of 100 instances where model pre- 
dictions are all ~ 0.7, we would expect 70 of the instances 
to be the positive class, and 30 to be in the negative class. If 
more than 70 are true positives, the model is under-confident 
for those 100 instances, while if fewer are true positives, the 
model is overconfident. Good model calibration is desirable 
so that predictions are interpretable as probabilities, allow- 
ing decision thresholds to be set in meaningful ways (e.g., 
triggering an intervention only if the model is at least 90% 
confident, knowing that it will thus result in a 10% false 
positive rate). 


We measured calibration by calculating CAL scores [12]. 
The CAL score for a model is calculated by sorting all N 
instances according to predicted probability, then dividing 
into N — 99 sliding windows of 100 instances (sliding by 1 in- 
stance). For each window, we calculated the absolute differ- 
ence between the base rate of the positive class for those 100 
instances and the mean predicted probability for the same 
instances. The CAL score consists of the mean of those ab- 
solute differences across all windows, and can be interpreted 
as the mean absolute error in model confidence. 


4.5 Datasets 

4.5.1 Video-based Engagement Detection Datasets 
We obtained six datasets from a study that measured stu- 
dents’ self-reported engagement during an essay writing task 
[31], during which students’ faces were recorded by a video 
camera. Students made verbal judgments of their engage- 
ment in the moment (concurrently) in response to auditory 
probes. One week later, they made retrospective judgments 
of their engagement by viewing video clips of themselves that 
were recorded during the essay writing task. There were 23 
students who made a total of 530 judgments of engagement 
during the writing task and 1,325 retrospective judgments. 
Researchers extracted three sets of features from videos: 1) 
heart rate, estimated via photoplethysmography [32]; 2) an- 
imation units (ANUs), a set of facial feature descriptors pro- 
vided by the Microsoft Kinect SDK, which are analogous to 
facial action units (AUs) [19]; and 3) local binary patterns 
in three orthogonal planes (LBP-TOP) [50], which capture 
facial textures and how those textures change over time. 


There were thus two sets of labels and three sets of features, 
for a total of six video-related datasets. We refer to the 
two heart rate datasets as VIDEO-HR-C (concurrent labels) 
and VIDEO-HR-R (retrospective labels). Similarly, we refer to 
the two animation unit datasets as VIDEO-ANU-C and VIDEO- 
ANU-R, and the two LBP-TOP datasets as VIDEO-LBP-C and 
VIDEO-LBP-R. 


4.5.2 Cognitive Tutor Algebra Datasets 

We obtained two datasets from a study [36] in which 59 stu- 
dents interacted with a computerized learning environment 
called Cognitive Tutor Algebra [3]. Students used Cogni- 
tive Tutor Algebra for an entire year as part of their reg- 
ular mathematics curriculum. Researchers labeled 10,397 
sequences of student actions in the learning environment for 
the presence of “gaming the system” behavior, where stu- 
dents attempt to progress through material by exploiting 
features of the learning environment (e.g., requesting hints 
repeatedly, guessing many answers) [6]. 


Researchers extracted two sets of features. Pattern features 
captured the presence or absence of 60 different sequences of 
actions that were designed to be similar to patterns identi- 
fied by domain experts. We refer to the dataset with pattern 
features as CTA-PF in this paper. The second set of features 
consisted of 25 count features. Count features captured the 
number of times 6 different actions occurred as well as the 
number of times 19 different events occurred. Events were 
identified by domain experts, and included things like paus- 
ing between attempts to answer a problem or trying to reuse 
an answer in multiple steps of a problem. We refer to the 
dataset with 25 count features as CTA-C in this paper. 


4.5.3 Student Survey Datasets 

Two additional datasets came from surveys obtained from 
788 students at two different secondary schools during the 
2005-2006 school year [17]. The survey consisted of 30 ques- 
tions, including demographics, which school they attended 
(of two possibilities), and other variables. We one-hot en- 
coded variables with categorical answers. Labels in both 
datasets consisted of course grades recorded on a 0-20 scale. 
We converted these to binary labels by splitting on the me- 
dian into high and low grades, so that all datasets would be 
comparable binary classification problems. 


One of the datasets came from students in a mathematics 
course (MATH, with 395 students) and the other from a Por- 
tuguese language class (PORTUGUESE, with 649 students). 
Some students were in both classes; thus, the total number 
of students was less than the sum of the classes. 


4.5.4 Educational Process Mining Dataset 

We also extracted features from an educational process min- 
ing (EPM) dataset. Students worked on electronics exercises 
in a software environment called DEEDS (Digital Electron- 
ics Education and Design Suite). Students’ actions in the 
learning environment were timestamped and logged, and in- 
cluded mouse movements, keystrokes, and information about 
the exercises being solved. Grade data were provided for five 
learning sessions, from which we extracted features including 
time spent on activities, number of actions, mean, standard 
deviation, and other summary features from problem-level 
data. In total, 115 students participated, but grades and 
action log data were not available for all students in every 
session. Grades were recorded on a numeric scale, though we 
again converted these to classification problems via median 
split to maintain consistency with other datasets. 


5. RESULTS AND DISCUSSION 


We focus results on the four research questions outlined in 
Section 3; we also provide model correctness results in the 
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Appendix, but do not focus on these results here since the 
goal of this work is to compare metrics rather than focus 
on improving over previously-published models. Our exper- 
iments to address the research questions included 4 different 
machine learning algorithms, 2 methods of calculating re- 
sults during cross-validation, and 11 datasets. The different 
machine learning algorithms yielded similar patterns for our 
primary research question (RQ1), with only a few exceptions 
(Figure 2). Similarly, results differed little across macro- and 
micro-averaging methods (Figure 3). Thus, we aggregated 
across Classification algorithms and averaging methods to 
address our research questions without unnecessarily divid- 
ing results into 8 (2 averaging levels x 4 classifiers) subsets. 
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Figure 2: Mean ranking for each machine learning 
algorithm and feature selection metric. “Log. Reg.” 
refers to logistic regression. 
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Figure 3: Mean ranking for feature selection met- 
rics when calculating results via macro-level versus 
micro-level averaging. 


5.1 Mean Rankings 

RQ1 asks When selecting features based on a specific metric, 
how do the results vary in terms of the other metrics? Re- 
sults in Table 1 show that MCC was, on average, the best 
(lowest) across models and datasets. Mean ranking for MCC 
averaged 3.441 across all datasets, while AUC and MPAUC 
were similar with mean rankings of of 3.468 and 3.476 respec- 
tively. Low rank for MCC indicates that, across 11 datasets, 


selecting features based on improvement in MCC yielded 
better results (in terms of itself and the other 8 metrics) 
than selecting features based on any of the other metrics. 
Specifically there were 3.441 correctness metrics on average 
for which selecting features based on some metric other than 
MCC yielded better results than MCC. 


Conversely, precision was the worst-performing metric in 
terms of producing good results for other correctness met- 
rics, with a mean ranking of 5.672. Recall and accuracy both 
had mean rankings above 4, while all other selection metrics 
had rankings ~ 3.5. 


There was also some notable variation across datasets. Pos- 
sible causes of variations include the differing types of fea- 
tures in the datasets (binary, continuous, counts, etc.), class 
imbalance, and problem difficulty (e.g., signal to noise ra- 
tio). A handful of datasets had significantly lower mean 
rank values for a specific metric when compared other met- 
rics and the average value across all datasets for the metric 
itself. For example, in the PORTUGUESE dataset, AUC was 
a particularly effective metric. AUC’s mean ranking was 
1.764, indicating that selecting features based on AUC in 
that dataset was almost always better (in terms of itself and 
the other metrics) than optimizing for those metrics was. 
In other datasets like VIDEO-LBP-C, the best metric had a 
much higher mean ranking. Similarly, metrics like F; and 
Accuracy had unusually low mean rank values for the MATH 
and VIDEO-HR-R datasets, respectively. In such cases, one 
metric did not frequently outperform the others. 


We also explored RQ1 visually by counting the number of 
datasets for which each metric had at least a certain rank- 
ing or better (Figure 4), much like constructing a receiver 
operating characteristic curve requires finding predictions 
above every possible threshold. In Figure 4, higher curves 
are better, indicating that there were more datasets where 
the metric had a desirable ranking. The curve for precision 
was Clearly lowest, followed by recall and then accuracy. The 
rest of the metrics were similar to one another, though the 
consistency of MCC is apparent from the fact that it was the 
first metric to achieve a certain ranking across all datasets. 


5.2 Probability Calibration 

RQ2 asks How do different feature selection metrics impact 
model calibration? The features that are selected can in- 
fluence how well it is theoretically possible to calibrate a 
model. For example, a model with two binary features can 
only output four possible values, and thus it is quite likely 
the model will be unable to output predicted probabilities 
that closely align with the true probability that the model’s 
prediction is correct or not. 


Results show that RMSE easily produced the best results 
(Table 2), with a mean calibration score (CAL) of 0.166 and 
the best CAL score in 8 of the 11 datasets. Recall had the 
worst calibration score averaged across models and datasets, 
followed by precision, accuracy and Fj. 


5.3 Positive Class Predicted Rate 


RQ3 asks How do different feature selection metrics impact 
the predicted rates of models? The predicted rate of mod- 
els is in some respects related to model calibration, since 
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Table 1: Mean ranking for each metric and dataset. Lower is better, indicating that a metric, on average, 
yielded better results in terms of itself and the other metrics. Values range from 0 (selecting features for that 
metric always produced the best score in terms of itself and the other metrics) to 9 (the number of metrics). 
The best metric for each dataset is highlighted in green, while the worst is in red. 


Dataset Accuracy AUC F1 Kappa MCC MPAUC Precision RMSE Recall 
CTA-C 6.319 2.653 2.153 2.681 3.167 4.181 4.778 3.528 6.542 
CTA-PF 5.403 2.222 4.903 4.625 3.986 2.208 6.069 3.736 2.847 
VIDEO-ANU-C 2.958 4.069 3.583 3.403 4.194 3.806 6.556 5.250 2.181 
VIDEO-HR-C 3.847 5.111 4.514 3.764 3.556 4.181 5.153 2.333 3.542 
VIDEO-LBP-C 3.806 3.389 3.986 3.528 3.319 3.986 5.875 4.097 4.014 
VIDEO-ANU-R 4.931 3.306 3.431 5.139 2.931 3.264 4.764 3.542 4.694 
VIDEO-HR-R 2.000 4.389 4.111 4.333 3.583 4.722 5.319 3.306 4.236 
VIDEO-LBP-R 3.833 2.361 6.458 2.528 4.056 3.069 4.597 3.306 5.792 
EPM 3.222 5.319 3.208 2.458 3.125 2.694 6.056 3.556 6.361 
MATH 5.556 3.569 1.583 3.333 2.222 2.653 6.472 4.056 6.556 
PORTUGUESE 4.819 1.764 3.028 3.792 3.708 3.472 6.750 2.125 6.542 
Mean 4.245 3.468 3.723 3.598 3.441 3.476 5.672 3.530 4.846 
Std. dev. 1.282 1.172 1.327 0.862 0.573 0.776 0.781 0.843 1.605 


Table 2: Mean calibration score of each metric for each dataset. Lower is better, where 0 indicates that 
predicted probabilities exactly matched the probability that that model’s predictions were correct. The best 
metric for each dataset is highlighted in green, while the worst is in red. 


Dataset Accuracy AUC F1 Kappa MCC MPAUC Precision RMSE Recall 
CTA-C 0.387 0.149 0.164 0.155 0.204 0.225 0.228 0.106 0.408 
CTA-PF 0.337 0.282 0.269 0.268 0.269 0.260 0.403 0.252 0.261 
VIDEO-ANU-C 0.271 0.281 0.281 0.263 0.278 0.261 0.320 0.257 0.271 
VIDEO-HR-C 0.199 0.223 0.215 0.210 0.207 0.217 0.237 0.171 0.213 
VIDEO-LBP-C 0.284 0.257 0.272 0.241 0.248 0.255 0.308 0.232 0.280 
VIDEO-ANU-R 0.235 0.217 0.233 = 0.228 0.220 0.223 0.235 0.214 0.257 
VIDEO-HR-R 0.199 0.194 0.209 0.199 0.198 0.205 0.217 0.173 0.202 
VIDEO-LBP-R 0.211 0.193 0.239 0.201 0.219 0.214 0.249 0.199 0.249 
EPM 0.067 0.147 0.069 0.060 0.066 0.071 0.099 0.063 0.184 
MATH 0.086 0.182 0.138 0.141 0.137 0.130 0.083 0.091 0.137 
PORTUGUESE 0.072 0.114 0.181 0.1138 0.140 0.117 0.133 0.066 0.207 
Mean 0.213 0.199 0.202 0.189 0.199 0.198 0.228 0.166 0.243 
Std. dev. 0.106 0.059 0.068 0.065 0.063 0.063 0.096 0.073 0.070 
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Figure 4: Step graph for mean rankings of met- 
rics used for wrapper feature selection across all 
datasets. The left edge of the z axis indicates the 
best (lowest) ranking, while the right indicates the 
worst (highest). The y axis indicates the number of 
datasets that have mean rank < z. 


a model that severely over- or under-predicts the positive 
class is unlikely to be well-calibrated (e.g., a model that 
always predicts 100% confidence for the positive class will 
have very poor calibration for any negative-class instances). 
Results reflect this calibration—predicted-rate relationship 
(Table 3), showing that selecting features based on recall 
resulted in the largest mean absolute difference between ac- 
tual base rate and predicted rate (0.233), while RMSE was 
close to best (0.080). Selecting features based on accuracy 
(proportion correct) did not produce inaccurate predicted 
rates (mean absolute difference = 0.079), however, despite 
relatively poor model calibration. 


For imbalanced datasets where classification is imperfect, ac- 
curacy can be inflated by over-predicting the majority class 
[10, 25]. However, Table 3 shows that selecting features 
based on accuracy did not have this effect, perhaps because 
we applied SMOTE to reduce the impact of class imbal- 
ance during training. Conversely, selecting features based 
on recall increased the positive class predicted rate for most 
datasets, since doing so can inflate recall regardless of the 
presence of class imbalance [10]. Similarly, selecting features 
based on precision often resulted in under-prediction of the 
positive class (10 out of 11 datasets). 


5.4 Number of Features Selected 

Selecting features based on precision yielded the fewest num- 
bers on average (4.173), while selecting based on RMSE 
yielded the most (10.523). Selecting features based on AUC 
also yielded more features (10.006, on average) than other 


metrics except RMSE. 


These patterns are likely due to the fact that adding rela- 
tively unimportant features to a model will offer only marg- 
inal improvement, and may not be enough to shift predic- 
tions above or below the decision threshold. All of the met- 
rics that require a decision threshold (accuracy, Fi, kappa, 
MCC, MPAUC, precision, and recall) resulted in fewer fea- 
tures than the threshold-free metrics of AUC and RMSE. 
For example, adding a feature that applies to only a few 
instances may help push the probability decision for those 
few instances in the right direction, but may not change 
the binary decision for those instances and thus may not be 
selected when evaluating based on threshold-based metrics. 


6. LIMITATIONS AND FUTURE WORK 


There are a few limitations to the experiments in this pa- 
per. First, the datasets that we analyzed represent only a 
handful from among thousands of educational datasets that 
researchers and others have collected over the years. Our 
datasets are also quite diverse, measuring very different stu- 
dent characteristics. Thus, we have only a sparse sampling 
of the space of educational datasets, and datasets that vary 
notably from those reported on here could exhibit differ- 
ent trends. Future work is especially needed in this area to 
discover specific properties of datasets (e.g., number of fea- 
tures, type of features) that inform which metrics are likely 
to be successful for wrapper feature selection. Such analysis 
is only possible with a large enough number of datasets to 
enable statistical comparisons at the dataset level. 


Second, the metrics we examined also only represent a subset 
of many possible. Many other metrics are closely related to 
those we studied (e.g., informedness, markedness, balanced 
accuracy), but may not exhibit exactly the same patterns. 
We selected a diverse mix of commonly reported metrics and 
some less-common metrics, all of which have been shown to 
be useful in previous research. 


Third, we explored only four of the most prominent machine 
learning classifiers from among many possible options. We 
chose these classifiers because they are represented in many 
education-related research endeavors, but results for other 
classifiers may differ. Perhaps most importantly, deep neu- 
ral networks are increasingly popular for educational data 
mining research [2, 28, 48, 47, 42], but were not considered 
here. Wrapper feature selection is perhaps less common for 
deep neural networks, given the high computational cost of 
model training, but correctness metrics often play a simi- 
lar role in the model selection process for neural networks — 
for example, when deciding when to stop training a model. 
In future work we will explore issues of model selection for 
neural networks as well. 


Fourth, averaging across the four classifiers is a limitation 
as well. While classifiers performed somewhat similarly, Fig- 
ure 3 shows some exceptional cases. For example, kappa per- 
formed poorly with random forest, and precision performed 
well with logistic regression. As part of future work, we will 
explore classifier-based analysis of metrics in more depth, 
including statistical analyses (e.g., Friedman test) where we 
consider a large number of classifiers as judges that are rank- 
ing metrics. 
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Table 3: Mean predicted rate of the positive class for models with features selected based on each metric, for 
each dataset. Base rate indicates the actual proportion of the positive class in the dataset. The last row refers 
to the mean absolute difference between predicted rate and base rate across datasets. Green highlighting 
indicates the closest match to the true base rate, while red indicates the predicted rate furthest away in each 


row. 
Dataset Base rate Accuracy AUC F1 Kappa MCC MPAUC Precision RMSE Recall 
CTA-C 0.068 0.060 0.179 0.170 0.169 0.215 0.250 0.149 0.111 0.695 
CTA-PF 0.068 0.029 0.084 0.084 0.084 0.084 0.087 0.003 0.064 0.085 
VIDEO-ANU-C 0.776 0.612 0.502 0.591 0.562 0.534 0.554 0.353 0.533 0.557 
VIDEO-HR-C 0.776 0.669 0.688 0.669 0.724 0.705 0.703 0.681 0.753 0.663 
VIDEO-LBP-C 0.776 0.631 0.526 0.586 0.617 0.563 0.548 0.385 0.588 0.607 
VIDEO-ANU-R 0.733 0.610 0.637 0.567 0.594 0.616 0.611 0.581 0.657 0.568 
VIDEO-HR-R 0.733 0.718 0.658 0.703 0.690 0.705 0.683 0.627 0.732 0.702 
VIDEO-LBP-R 0.733 0.590 0.614 0.529 0.610 0.572 0.560 0.389 0.629 0.590 
EPM 0.237 0.312 0.405 0.321 0.3809 0.319 0.331 0.214 0.315 0.585 
MATH 0.410 0.373 0.505 0.627 0.491 0.537 0.552 0.120 0.484 0.730 
PORTUGUESE 0.425 0.437 0.524 0.609 0.509 0.573 0.532 0.085 0.473 0.840 
Mean |A| 0.079 0.126 0.1385 0.098 0.123 0.128 0.210 0.080 0.233 


Table 4: Number of features in each dataset (N) and mean number of features selected by each metric. The 
highest number of selected features for each dataset is highlighted in light blue, while the lowest is highlighted 


in gray. 
Dataset N Accuracy AUC F1 Kappa MCC MPAUC Precision RMSE Recall 
CTA-C 25 2.531 10.313 8.781 8.750 5.750 4.094 7.188 12.969 2.875 
CTA-PF 60 10.469 35.219 25.125 24625 26.125 28.156 1.000 28.094 27.969 
VIDEO-ANU-C 42 3.656 5.031 3.875 4.500 4.531 4.625 3.219 5.438 4.031 
VIDEO-HR-C 7 3.313 3.188 2.594 3.563 3.875 3.531 2.750 4.094 2.969 
VIDEO-LBP-C 2304 3.563 6.344 4.031 5.750 5.781 4.844 2.000 7.656 3.625 
VIDEO-ANU-R 42 4.281 6.063 5.063 5.594 4.906 4.813 3.875 7.688 4.281 
VIDEO-HR-R ite 3.531 3.563 3.031 3.938 3.781 3.563 3.938 4.906 2.719 
VIDEO-LBP-R 2304 8.344 12.656 6.500 9.125 9.500 9.469 6.500 16.781 4.750 
EPM 38 6.375 6.344 6.219 7.125 6.688 5.813 7.156 7.406 1.000 
MATH 43 7.000 8.719 6.438 8.656 7.781 7.781 4.313 8.250 1.375 
PORTUGUESE 43 10.844 12.625 7.719 11.344 9.531 9.719 3.969 12.469 1.313 
Mean 446.818 5.810 10.006 = 7.216 8.452 8.023 7.855 4.173 10.523 5.173 
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7. CONCLUSION 


As the field of educational data mining develops, and ma- 
chine learning becomes increasingly popular for modeling 
student outcomes, it is imperative to deeply understand each 
step of the process and the influence researchers’ choices 
have on models. Our experiments offer insight into the large 
differences that can arise from machine learning design de- 
cisions, specifically for feature selection. We showed that 
selecting features based on some metrics is rarely advisable 
(especially precision), and that the choice of metric has im- 
pacts not only on correctness measures but on other impor- 
tant properties of the resulting models, including calibration 
and size (number of features). 


We found that MCC produced the overall best results across 
the other metrics in terms of mean ranking as a measure of 
well-rounded correctness across metrics. MCC was not the 
best selection metric for all the datasets; in fact, it was the 
most effective only for 2 of the 11 datasets we analyzed in 
this study. However, it was more consistently well-ranked 
than the other metrics. On the other hand, RMSE produced 
the best-calibrated models, which can also be an important 
consideration for applying student models that might benefit 
from easily-adjustable decision thresholds. 


Student models are the driving forces in adaptive learning 
software. Thus, enhancing them will lead to better software 
for students and teachers. The results of this project will 
enable researchers to more accurately build models which 
predict student outcomes by informing the correctness met- 
rics relied upon for feature selection. In particular, we sug- 
gest utilizing metrics like MCC and RMSE (if calibration is 
desirable) to yield models with well-rounded accuracy across 
metrics. We suggest avoiding recall, precision, and accuracy, 
even though accuracy is the default setting in some machine 
learning software. 
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APPENDIX 


A. OVERALL CORRECTNESS AND COM- 
PARISON TO PREVIOUS RESULTS 


In this appendix we provide the correctness results obtained 
from our experiments, as a point of comparison to previous 
work and for comparison in future work. We report results 
where we selected features via MCC, since that metric had 
the best mean ranking in terms of other metrics (Table 1). 
We report only macro-level averaging, though results were 
similar for micro-level averaging (Figure 3). We also average 
results across all four classifiers, rather than selecting only 
the best classifier (and thus potentially introducing Type I 
error). 


Table 5 shows the overall correctness metrics, with compar- 
isons to previous work (where possible) noted by highlighted 
colors. For MATH and PORTUGUESE datasets we could not 
make direct comparisons because previous results did not 
perform median splitting to transform regression to binary 
classification. The feature selection criterion (which metric 
was used for selection) used by previous analysis is not clear 
in most cases. Hence, it is difficult to make close compar- 
isons to previous models. 


For the vIDEO-* datasets we compared AUC to [31]. In 
[31], AUCs reported were VIDEO-ANU-C: 0.635, VIDEO-HR-C: 
0.544, VIDEO-LBP-C: 0.645, VIDEO-ANU-R: 0.666, VIDEO-HR- 
R: 0.590 and VIDEO-LBP-R: 0.644. 


For the cTa-c dataset we compared AUC and kappa to [36]. 
However, the other models in this paper include feature-level 
fusion of both CTA-c and CTA-PF features, so they are not 
directly comparable to the CTA-PF features that we have. 
Reported values for CTA-c were AUC = 0.865 and kappa = 
0.332. 


For the EPM dataset we compared results to [22], which 
reports accuracy, F1, kappa, RMSE, precision and recall. 
However, accuracy, F1, precision and recall are reported 
for the random division and the alpha-investing feature se- 
lection methods and hence are not comparable to our re- 
sults. The values (averaged across the reported models) were 
kappa = 0.443 and RMSE = 0.490, though the division of 
grades into two categories may have been based on a differ- 
ent split value than we utilized in this paper (the median), 
so comparisons should be made with that in mind. 


Table 5: Our results for all metrics and datasets using MCC as the selection metric and macro-level averaging. 
Where previous results are known, green highlighting indicates that models we trained were better (more 
accurate) and red indicates that they were worse. Specific previous results are reported in the Appendix 


text. 
Dataset Accuracy AUC F1 Kappa MCC MPAUC Precision RMSE _ Recall 
CTA-C 0.819 0.874 0.368 0.295 0.363 0.796 0.242 0.361 0.770 
CTA-PF 0.919 0.740 0.466 0.423 0.427 0.735 0.425 0.357 0.521 
VIDEO-ANU-C 0.501 0.500 0.588 -0.004 -0.010 0.491 0.765 0.536 0.516 
VIDEO-HR-C 0.654 0.565 0.762 0.102 0.107 0.554 0.801 0.483 0.738 
VIDEO-LBP-C 0.534 0.511 0.633 0.006 0.002 0.500 0.773 0.518 0.570 
VIDEO-ANU-R 0.558 0.552 0.636 0.035 0.039 0.521 0.747 0.514 0.585 
VIDEO-HR-R 0.622 0.5386 0.729 0.072 0.080 0.539 0.750 0.496 0.727 
VIDEO-LBP-R 0.560 0.568 0.646 0.067 0.076 0.545 0.758 0.508 0.577 
EPM 0.871 0.915 0.764 0.678 0.695 0.882 0.667 0.322 0.901 
MATH 0.619 0.667 0.599 0.252 0.269 0.632 0.532 0.507 0.706 
PORTUGUESE 0.656 0.722 0.659 0.331 0.362 0.675 0.571 0.491 0.797 
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